C3D: Mitigating the NUMA Bottleneck via Coherent DRAM Caches

机译：C3D：通过相干DRam缓存减轻NUma瓶颈

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Massive datasets prevalent in scale-out, enterprise, and high-performance computing are driving a trend toward ever-larger memory capacities per node. To satisfy the memory demands and maximize performance per unit cost, today’s commodity HPC and server nodes tend to feature multi-socket shared memory NUMA organizations. An important problem in these designs is the high latency of accessing memory on a remote socket that results in degraded performance in workloads with large shared data working sets.This work shows that emerging DRAM caches can help mitigate the NUMA bottleneck by filtering up to 98% of remote memory accesses. To be effective, these DRAM caches must be private to each socket to allow caching of remote memory, which comes with the challenge of ensuring coherence across multiple sockets and GBs of DRAM cache capacity. Moreover, the high access latency of DRAM caches, combined with high inter-socket communication latencies, can make hits to remote DRAM caches slower than main memory accesses. These features challenge existing coherence protocols optimized for on-chip caches with fast hits and modest storage capacity. Our solution to these challenges relies on two insights. First, keeping DRAM caches clean avoids the need to ever access a remote DRAM cache on a read. Second, a non-inclusive on-chip directory that avoids tracking blocks in the DRAM cache enables a light-weight protocol for guaranteeing coherence without the staggering directory costs. Our design, called Clean Coherent DRAM Caches (C3D), leverages these insights to improve performance by 6.4-50.7% in a quad-socket system versus a baseline without DRAM caches.

机译：在横向扩展，企业级和高性能计算中普遍存在的海量数据集，正在推动一个趋势，即每个节点的存储容量越来越大。为了满足内存需求并最大程度地提高单位成本的性能，当今的商用HPC和服务器节点倾向于采用多插槽共享内存NUMA组织。这些设计中的一个重要问题是访问远程套接字上的内存的高延迟，这会导致具有大量共享数据工作集的工作负载的性能下降。这项工作表明，新兴的DRAM缓存可通过过滤高达98％的内存来帮助缓解NUMA瓶颈。远程内存访问。为了有效地发挥作用，这些DRAM高速缓存必须对每个插槽专用，以允许对远程内存进行缓存，这带来了确保跨多个插槽和GB DRAM高速缓存容量的一致性的挑战。此外，DRAM高速缓存的高访问延迟与套接字间的高通信延迟相结合，可以使对远程DRAM高速缓存的命中速度比主存储器访问慢。这些功能对现有的一致性协议提出了挑战，这些协议针对具有快速命中和适度存储容量的片上缓存进行了优化。我们对这些挑战的解决方案依赖于两种见解。首先，保持DRAM高速缓存的整洁避免了需要在读取时访问远程DRAM高速缓存。其次，避免跟踪DRAM缓存中的块的非包容片上目录启用了轻量级协议，可确保一致性，而不会增加目录成本。我们的设计称为Clean Coherent DRAM缓存（C3D），利用这些见解将四路系统的性能提高了6.4-50.7％，相比之下，没有DRAM缓存的基准。

著录项

作者
Huang, Cheng-Chieh; Kumar, Rakesh; Elver, Marco; Grot, Boris; Nagarajan, Vijayanand;
展开▼
作者单位

展开▼
年度 2016
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches [J] . Chiachen Chout, Aamer Jaleel, Moinuddin K. Qureshi Computer architecture news . 2015,第3期

机译：BEAR：减轻千兆级DRAM缓存中带宽膨胀的技术
2. The cache DRAM architecture: a DRAM with an on-chip cache memory [J] . Hidaka H., Matsuda Y. IEEE Micro . 1990,第2期

机译：缓存DRAM架构：具有片上缓存的DRAM
3. Die-Stacked DRAM Caches for Servers Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache [J] . Djordje Jevdjic, Stavros Volos, Babak Falsafi Computer architecture news . 2013,第3期

机译：芯片堆叠式DRAM缓存是针对服务器的命中率，延迟还是带宽？拥有足迹缓存
4. C3D: Mitigating the NUMA bottleneck via coherent DRAM caches [C] . Cheng-Chieh Huang, Rakesh Kumar, Marco Elver, Annual IEEE/ACM International Symposium on Microarchitecture . 2016

机译：C3D：通过一致的DRAM缓存缓解NUMA瓶颈
5. Study and Analysis of Energy-Efficient DRAM-Cache with Unconventional Row-Buffer Size. [D] . Tshibangu, Nyunyi Marcus. 2016

机译：具有非常规行缓冲区大小的节能DRAM缓存的研究和分析。
6. In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs [O] . Ho Hyun Shin, Eui-Young Chung 2019

机译：用于低延迟和低功耗3D堆叠DRAM的DRAM中缓存管理
7. Mitigating the Gateway Bottleneck via Transparent Cooperative Caching in Wireless Mesh Networks [O] . Das, Saumitra M., Pucha, Himabindu, Hu, Y. Charlie 2006

机译：通过无线网状网络中的透明协作缓存缓解网关瓶颈

C3D: Mitigating the NUMA Bottleneck via Coherent DRAM Caches

摘要

著录项

相似文献

相关主题

期刊订阅